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Abstract 



o 

Q^) ' In this paper, we propose an adaptive algorithm that iteratively updates both the weights and 

component parameters of a mixture importance sampling density so as to optimise the perfor- 
mance of importance sampling, as measured by an entropy criterion. The method, called M-PMC, 
r/3 , is shown to be applicable to a wide class of importance sampling densities, which includes in par- 

ticular mixtures of multivariate Student t distributions. The performance of the proposed scheme 
is studied on both artificial and real examples, highlighting in particular the benefit of a novel 
J> ■ Rao-Blackwcllisation device which can be easily incorporated in the updating scheme. 

\ Keywords: Importance sampling, Adaptive Monte Carlo, Mixture model, Entropy, Kullback- 

■ Lciblcr divergence, EM algorithm, Population Monte Carlo. 

2 ■ 1 Introduction 

In recent years, there has been a renewed interest in using Monte Carlo procedures based on Impor- 
tance Sampling (abbreviated to IS in the following) for inference tasks. Compared to alternatives 
^ i such as Markov Chain Monte Carlo methods, the main appeal of IS procedures lies in the possibility 
of developing parallel implementations, which becomes more and more important with the general- 
isation of multiple core machines and computer clusters. Importance sampling procedures are also 
attractive in that they allow for an easy assessment of the Monte Carlo error (provided trustworthy 
estimates of the variance can be produced). As a consequence, it is therefore easier to construct learn- 
ing mechanisms in IS settings because of this ability to compare the errors. In many applications, 
the fact that IS procedures may be tuned — by choosing an appropriate IS density — to minimise the 
approximation error for a specific function of interest is also crucial. On the other hand, the short- 
comings of IS approaches are also well-known, including a poor scaling to highly multidimensional 
problems and an acute sensitivity to the choice of the IS density combined with the fact that it is 
impossible to come up with a universally efficient IS density. While there exist a wide variety of 
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solutions in the literature (see, e.g. iRobert and Casellal . 120041 . Chapter 14), this paper concentrates 
on the construction of adaptive importance sampling schemes in which the IS density is gradually 
improved based on the outcome of previous Monte Carlo draws. 



W hile the method proposed here can be traced back to authors such as lWestl (|1992| ) or lOh and Berger 
(|l993h . it is closely related to the so-called Population Monte Carlo (henceforth abbreviated to PMC) 
approach — in the sense of an iterated simulation of importance samples and in oppositi on to Markov 
Chain Monte Carlo simulation that only produces a point at a time — int roduced by ICappe et al 



(120041') . We briefly review the PMC approach, following the exposition of ICappe et al.l (J2004J) and 



ly re 

Douc et al.l (|2007al lb1) . in order to highlight the differences with the present work. In PMC, a sample 
(Xi, . . . ,Xn) approximately distributed from tt, is repeatedly perturbed stochastically using an ar- 
bitrary Markov transition kernel q(x, x') so as to produce a new sample (X[, . . . , X' N ). Conducting a 
resampling step based on the IS weights Ui = n(X^)/q(Xi,X' i ), it is then possible to produce a new 
unweighted sample (X\, . . . , Xn) that also constitutes an approximation to the target distribution tt. 
Adaptivity in PMC was achieved by considering a transition kernel q consisting of a mixture of fixed 
transition kernels 



q a (x,x') 



D 

^a d q d {x,x') 

d=l 



D 



i 



(i) 



d=l 



whose weights ct\, . . . ,ao are tun ed adaptive l y, alon g the iteration of the PMC algorithm. The 
adaptation procedure proposed by iDouc et al.l (|2007al ). termed D -kernel PMC, aims at minimising 
the deviance or entropy criterion between the kernel q a and the target tt, 



(B(7r,q a )=^ [D(ir\\q a (X,-))] 



(2) 



where D(p\\q) = j log{p(x)/q(x)} p(x)dx denotes the Kullback-Leibler divergence (also called relative 
entropy), and where the expectation is taken under the target distribution X ~ tt since the kernels 
qd(x,x') depend on the starting value x. In the sequel, we refer to the criterion in ([2]) as the entropy 
cr iterion since it is obviously re lated to the perf o rmance measure used in the cross-entropy method 
of Rubinstein and Kroese ( 20041 ) . In Douc et al. I (|2007bh . a version of this algorithm was developed 
to minimise the asymptotic variance of the IS procedure, for a specific function of interest, in lieu of 
the entropy criterion. 

A major limitation in the approaches of Douc et al. ( 2007al lbl) is that the proposal kernels q d 
remain fixed over the iterative process while only the mixture weights a d get improve d. In the 



present contribution, we remove this limitation by extending the framework of lDouc et al 
allow for the adaption of IS densities of the form 



(l2007ah 



to 



D 

£ 

d=l 



a d q d (x;0 d ) , 



(3) 



with respect to both the weights a d and the internal parameters 9 d of the component densities. The 
proposed updating mechanism is quite similar to the EM algorithm with the E-step replaced by IS 
computations. As demonstrated through the example considered in Section this adaptive scheme 
is applicable to very general families of latent-data IS densities. A possible drawback of adapting 
the internal parameters 9 d of the component densities is that it sometimes raises challenging robust- 
ness issues, particularly when (multidimensional) scaling parameters are tuned. We thus propose a 
Rao-Blackwellisation scheme that empirically appears to be very efficient while inducing a modest 
additional algorithmic complexity. 



Note again that we consider here the generic entropy cr iterion of IDouc et al 



than the function-specific variance minimisation objective of IDouc et al.1 (|2007bl ). 



(|2007al 1 rather 
This choice is 

motivated by the recognition that in most applications, the IS density is expected to perform well 
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for a range of typical functions of int erest rather than fo r a specific target function h. In addition, 
the generalisation of the approach of Douc et al. ( 2007bl ) to a class of mixture IS densities that are 



parameterised by more than the weights remains an open question (see also Section [5]). A second 
remark is that in contrast to the previously cited works and as obvious in equation ([3]), we consider 
in this paper only "global" independent IS densities. Thus, the proposed scheme is based on genuine 
iterated importance sampling, contrary to what happens when using more general IS transition kernels 
as in (HJ). Obviously, resorting to moves that depend on the current sample is initially attractive 
because it allows for some local moves as opposed to the global exploration required by independent 
IS densities. However, the fact that the entropy criterion in ^) is a global measure of fit tends to 
modify the parameters of each transition kernel depending on its average performance over the whole 
sample, rather than locally. In addition, structurally imposing a dependence on the points sampled at 
the previous iteration induces some extra- variability which can be detrimental when more parameters 
are to be estimated. 

The paper is organised as follows: In Section [2j we develop a generic upd ating schem e for in - 
dependent IS mixtures ([3]), establishing that the integrated EM argument of Douc et al. ( 2007a! ) 



remains valid in our setting. Note once again that the integrated EM update mechanism we un- 
cover in this paper is applicable to all missing data representations of the proposal kernel, and not 
only to finite mixtures. In Section (3j we consider the case of Gaussian mixtures which naturally 
extend the c ase of mixtures of Gaussian random walks with fixed covariance structure considered in 
Douc et all (j2007al lbh. In Section H we show that the algorithm also applies to mixtures of multi- 



variat e t distributions with the continuous scale mixing representation used in lPeel and McLachlan 



( 2000| ). Section [5] provides some conclusive remarks about the performance of this approach as well 



as possible extensions. 



2 Adapting the Importance Sampling Density 



2.1 The M-PMC Algorithm 

When considering independent mixture IS densities of the form ([3]), the entropy criterion (£ defined 
in ([2|) reduces to the Kullback-Leibler divergence between the target density tt and the mixture q( a ,e)' 



£(n,q( a ,9)) = D(n\\q iat6) ) = / log — — 



ir(x) 



adqd(x;0 d ) 



7r(a;)da 



(4) 



As usual in applications of the IS methodology to Bayesian inference, the target density tt is known 
only up to a normalisation constant and we will focus on a self-normali sed version of IS that solely 
requires the availability of an unnormalised version of tt (jGewekd . 1 19891) . As a side co mment, note 
that while £(tt, q^ a fi)) is a convex function of the weights a±, . . . , od (jDouc et al.l . l2007al ). it generally 
fails to be so when also optimising with respect to the component parameters 9±, . . . , 6d- Given that 
minimising in (a, 8) is equivalent to maximising 



(5) 



we are facing a task that formally resembles standard mixture maximum likelihood estimation but 
with an integration with respect to tt replacing the empirical sum over observations. 

This analogy suggests that it is possible to maximise the entropy criterion in using an approach 
based on the principle of the EM algorithm and, in particular, the use of the augmented mixture 
representation (involving the indicator variables associated with each component of the mixture). 
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Before providing the details of the derivation in Section f2.2l we first state below the proposed adaptive 
IS algorithm which we will refer to as M-PMC (for Mixture PMC) in the following. Let (-^i,t)i<i<jv 
and {a t,N , 9 t,N ) denote, respectively, the IS sample and the estimated mixture parameters at the 4-th 
iteration of the algorithm. 

Algorithm 1. (M-PMC Algorithm) At iteration t, 

1. Generate a sample (Xi jt ) from the current mixture IS proposal ([3]) parameterised by {a t,N , 9 t,N ) 
and compute the normalised importance weights 



N 



V^D t,N lv nt>Ns 



s^u t,N i v /)^,^ r ^ 



and the mixture posterior probabilities 



D 



£ i 



(6) 



(7) 



for i = 1, . . . , N and d = 1, . . . , D. 
2. Update the parameters a and 6 as 

N 



a 



t+l,N 



for d = 1, 



qt+l,N 



D. 



Y J ^t Pd {X i ,f,a t ' N ^ N ) , 

i=l 

' N 

£ mm a^, O log {a (X M ; } 



arg max 



i=l 



(8) 



The convergence of the algorithm may be monitored by computing the so-called normalised per- 
plexity exp(H t ' N )/N, where H*' N = - u> ijt log * is the Shannon entropy of the normalised IS 
weights. The normalised perplexity provides an estimate of exp[— <£(tt, q^ a t,N ^.jvO] and, for sufficiently 
large iV, it is non-decreasing with t. 



2.2 Detailed Derivation 
Integrated Updates 

Starting from ([5]), assume for the moment that integration with respect to ir is feasible. In order 
to update the parameters of the independent IS density ([3]), we will take advantage of the latent 
variable structure that underlines the objective function ([5|). The resulting algorithm — still theo- 
retical at this stage as it involves integration with respect to ir — may be interpreted as an inte- 
grated EM (Expectation-Maximisation) scheme that we now describe. Let a* = (a\, . . . , a^) and 
6 t = [Q\, . . . , Ojj) denote, respectively, the mixture weights and the component parameters at the i-th 
iteration of this integrated EM algorithm. 

As usual in mixtures, the latent variable Z is the component indicator, with values in {1, . . . , D} 
such that the joint density / of x and z satisfies 

f(z) = a z and f(x\z) = q z (x;0 z ), 
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which produces ([3]) as the marginal in x. As in the standard EM algorithm, we can then take 
advantage of this latent variable representation. Since the joint density of X and Z is a z q z (x; 9 X ), the 
expectation corresponding to the E step of the EM algorithm is the expected complete log-likelihood, 
namely, at iteration t of our algorithm, 



E; 



Ef at>flt) {Iog (a z q z (X;9 z ))\X} 



where the inner expectation is computed under the conditional distribution of Z in the mixture model 
given the current value (a*, 9 l ) of the parameters, i.e. 



D 



f(z\x) = a*g*(x;0*) / J] a d q d (x; 0* d ) 



d=l 



while the outer expectation is under the distribution X ~ tt. 

The proposed updating mechanism then corresponds to setting the new parameters (a t+1 ,9 t+1 ) 
equal to 



(a 



t+1 at+l- 



argmax E^ Ef Q , flt) {log(a z q z (X; B Z ))\X) 

(a,9) L v ' ; 



0) 



as in the regular EM estimation of the parameters of a mixture, except for the extra expectation over 
X. It is straightforward to check that the convexity argument used for the EM algorithm also applies 
in this setup and, hence, that (C(tt, e*)))t>i is a non-decreasing sequence. Setting 



D 



Pd(X; a, 9) = a d q d (X; 9 d ) / ^ aeqe(X; B£) , 

' i=i 

the maximisation program in ([9]) reduces to 



a t+1 = arg max E^ 



x 



D 



9 t+1 = arg max E^ 



x 



^p d (X;a t ,^)log(a d 

.d=l 
D 

^ W (X;a*,^)log( % (X;0 d )) 



d=l 



where the first maximisation to be carried out under the constraint that ^2 d= ± cn d +1 = 1. Hence, 

q* +1 =E; Y [pdiX-a 1 ^ 1 )] , (10) 
d +1 = argmaxE^ [p d (X; a\ 9*) log(%(X; d ))] . (11) 



As in the regular mixture estimation problem, the resolution of this maximisation program ul- 
timately depends on the shape of the density q d - If q d belongs to an exponential family, it is easy 
to derive a closed-form solution for (jlip . which however involves expectations under tt. Section [3] 
provides an illustration of this fact in the Gaussian case, while the non-exponential Student's t case 
is considered in Section |H 



Approximate Updates 

To make the previous algorithm practical, adaptivity must be achieved by updating the parameters 
based on the previously simulated IS sample. We thus start the algorithm by arbitrarily fixing the 
mixture parameters (a 1 , 9 1 ) and we then sample from the resulting proposal a d Qd(x; 9 d ) to obtain 
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our initial sample pQ ) i)x<j<iV; associated with the latent variables (^i,i)i<i<iv that indicate from 
which component of the mixture the corresponding (-X"i ; i)i<j<jv have been generated. Prom this 
stage, we proceed recursively. Starting at iteration t from a sample {Xi :t )i<i<N , associated with the 
latent variables (Zi t )i<i<N an d the normalised IS weights (&it)i<i<N defined in Q, we denote by 



t+l,N 



gt+1 



the updated value of the mixture parameters. 
To approximate (fTU|) and (fTTj) . iDouc et al.l (|2007al ) proposed the following update rule: 



N 



O'., 



t+l,N 



nt+l,N 
1 d 



} uJi >t l{Zi )t = d} , 

N 

^2u)i )t l{Z it t = d}log{g d (x i)t ;9 t ( i N '\\ 



arg max 



i=l 



(12) 



The computational cost of this update is of order N whatever the number D of components is, since 
the weight and the parameter of each component are updated based only on the points that were 
actually generated from this component. However, this observation also suggests that (|12p may be 
highly variable when TV" is small and/or D becomes larger. To make the update more robust, we here 
propose a simple Rao-Blackwellisation step that consists in replacing l{Zi t = d} with its conditional 
expectation given X^t, that is, pd [Xi,t\ ot t)N , 9 t,N ) defined in ((7J). The resulting parameters update is 
given by (jHJ), which we selected for Algorithm [TJ 

Examining (J7|) indicates that the evaluation of the posterior probabilities pd{Xi t t; a t,N \9 t,N ) does 
not represent a significant additional computation cost, given that the denominator of this expression 
has already been computed when evaluating the IS weights according to (jSJ). The most significant 
difference between dHJ) and (|12p is that, with the former, all points contribute to the updating of the 
tZ-th component, for an overall cost proportional to D x N. Note however that in many applications 
of interest, the most significant computational cost is associated with the evaluation of ir — which is 
performed exactly N times per iteration — so that the cost of the update is mostly negligible, even 
with the Rao-Blackwellised version. 



Convergence of the M-PMC Algorithm 

Con vergence of the estim ated parameters as iV increases can be established using the same approach 
as m IDouc et al.l(|2nn7al lbh. rel ying mainly on the convergence property of triangular arrays of random 



variables (see Theorem A.l in lDouc et al.l . l2007al ). For the Rao-Blackwellised version, assuming that 
for all 0's, Tv(qd(-]0d) = 0) = 0, for all a's and 9's, pd{-; a, 9) log qd(-, 9d) € L 1 {-k), and some (uniform 
in x) regularity conditions on qd{x; 9) viewed as a function of 9, yield 

t+l,N JP t+1 „t+l,N P ot+1 

when iV goes to infinity. Note that we do not expand on the regularity conditions imposed on qd 
since, for the algorithm to be efficient, we definitely need a closed-form expression on the parameter 
updates. It is then easier to deal with the convergence of the approximation of these update formulas 
on a case-by-case basis, as will be seen in the Gaussian example of Section [3j 

As a practical criterion for monitoring the convergence of the algorithm we recommend comput- 
ing the normalised perplexity exp(H t,N )/N (see Algorithm [T]) and to interrupt adaptation when it 
stabilises and/or becomes sufficiently close to 1. Note that in referring to exp(H t,N ) (exponential of 
the Shannon entropy expressed in nat) as the perplexity, we follow the terminology in use in the field 
of natural language processing. The connection between the perplexity and the entropy criterion ([2]) 
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is revealed by writing 

/ P it ( nr\ \ I f 

■^"unn (x)dx , (13) 



exp [-£(71-^0))] = exp (^J - log J""™^ ir(x)dx^j (^j 



where 7r unn refers to the unnormalised version of tt which is effectively computable. Estimating the 
first integral in (|13p by self-normalised IS as 

E_ , K umi (Xit) 
Wi,t log 7~v V 

and the second one by classical IS, as 

N 

l/ N '^2 7r unn (X itt )/q( a t,N ye t,N)(Xij), 
i=l 

indeed shows that exp(H t,N )/N is a consistent estimator of exp[— £(ir, q^ a t,N ^t,N^)]. The entropy of 
the IS weights is frequently used as a criterio n for assessing the quality of an IS sample — together 



tne lb weignts is frequently used as a criterio n lor assessing tne quality 01 an is sample — togetne 
with the so-called Effective Sample Size (ESS) (jChen and Liul . ll99fil . iDoucet et all , bOQll . lcappe et al 



To the best of our knowledge, however the strong connection between this criterion and the 
performance measure (B(ir,q^ a t,N ^gt,N^) used in the present work had not been noted before. 

Variance Estimation 

For the sake of completeness, we recall here the formula by which it is possible to estimate, from the IS 
sample, the asymptotic variance of the IS estimate. If one considers a test function h of interest, the 
self-normalised IS estimation of its expectation under it is 7r(/i) = J2iLi&ih(Xi) and its asymptotic 
variance is given by 

V (h) = J {h(x) - ir(h)} 2 ir 2 (x)/q at g(x)dx , 

under the assumption that f (1 + h 2 (x))'K 2 {x)/q a fi{x)dx < oo. The asymptotic variance v(h) may 
thus be consistently estimated by NY^iL\ ^f{h(Xi) — 7r(/i)} 2 ( Geweke . 1989h . 



3 The Gaussian mixture case 

As a first example, we consider the case of p-dimensional Gaussian mixture IS densities of the form 

q d (X; 6 d ) = {(2tt) p |S d |}~ 1/2 exp 1~{X - ^^{X - fx d )\ , 

where d = (fi d , T, d ) denotes the parameters of the d-th. Gaussian component density. This parametri- 
sation of the IS density provides a general framework for approximating multivariate targets tt and 
the corresponding algorithm is a straightforward instance of the general framework discussed in the 
previous section. 

3.1 Update formulas 

The integrated update formulas are obtained as the solution of 

6 d +1 ' N = argminE* a*, 0*) (log |S d | + (X - fi d ) T Z d \X - . 
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It is straightforward to check that the infimum is reached when, for d € {!,..., D}, 



and 



t+1 = g [pdjX-a^e^X] 
^ - EX[p d (X;a',0*)] ' 



rf " E*[p d (X;a',0')] 



At iteration i of the M-PMC algorithm, both the numerator and the denominator of each of the 
above expressions are approximated using self-normalised importance sampling. Denoting %{Zi t = d} 
by £i,t, the following empirical update equations are obtained for the basic updating strategy (fT2|) : 



N 

t+l,N 



i=l 

- t v N 
t+l,N Z^i=l w j,isi,t A i,* - t v / t+l,N 

= - ~, = 2^ u} i,&,tXi,t / a d 

A 7 " 

S d = 2^^.*^.*^.* )( X i,t-»d > l a d ■ ( 14 ) 

i=l 

For the Rao-Blackwellised scheme of Algorithm [TJ the update is formally identical to the one above 
upon replacing £a by its conditional expectation 

t«f = p d (X i>t ;a t ' N ,9 t ' N ). (15) 

Note that, as discussed in Section [2.21 establishing the convergence of the parameter update in this 
Gau ssian setting will o nly require the assumption that pd(x;a,9)x 2 is integrable with respect to it 
(see IDquc et alll2007al ) . 



3.2 A simulation experiment 

To illustrate the results of the algorithm presented above, we consider a toy example in which the 
target density consists of a mixture of two multivariate Gaussian densities. The appeal of this example 
is that it is sufficiently simple to allow for an explicit characterisation of the attractive points for the 
adaptive procedure, while still illustrating the variety of situations found in more realistic applications. 
In particular, the model contains an attractive point that does not correspond to the global minimum 
of the entropy criterion as well as some regions of attraction that can eventually lead to a failure of 
the algorithm. The results obtained on this example also illustrate the improvement brought by the 
Rao-Blackwellised update formulas in (fT5|) . 

The target ir is a mixture of two p-dimensional Gaussian densities such that 

tt(x) = 0.5M(x; — su p , I p ) + 0.5M(x; su p , I p ) , 

when Up is the p-dimensional vector whose coordinates are equal to 1 and l p stands for the identity 
matrix. In the sequel, we focus on the case where p = 10 and s = 2. Note that one should not 
be misled by the image given by the marginal densities of tt: in the ten dimensional space, the two 
components of tt are indeed very far from one another. It is for instance straightforward to check that 
the Kullback-Leibler divergence between the two components of tt, D {J\f(su p , I p )|| N(— su p , I p )}, is 
equal to ^||2su p || 2 = 2s 2 p, that is 80 in the case under consideration. In particular, were we to use 
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one of the components of the mixture as an IS density for the other, we know from the arguments 
exposed at the end of Section [2] that the normalised perplexity of the weights would eventually tend 
to exp(— 80). This number is so small that, for any feasible sample size, using one of the component 
densities of ir as an IS instrumental density for the other component or even for tt itself can only 
provide useless biased estimates. 

The initial IS density go is chosen here as the isotropic ten-dimensional Gaussian density with a 
covariance matrix of 5I p . The performances of qo as an importance sampling density, when compared 
to various other alternatives, are fully detailed in Table Q] below but the general comment is that it 
corresponds to a poor initial guess which would provide highly variable results when used with any 
sample size under 50, 000. 



Proposal 


N-PERP 


N-ESS 


a 2 On) 


qo t 


6.5E-4 


1.5E-4 


37E3 


Best fitting Gaussian t 


0.31 


0.27 


19 


Target mixture ' 


1 t 


1 t 


5t 


Best fitting Gaussian (defensive option) 


0.28 


0.23 


22 


Best fitting two Gaussian mixture (defensive option) 


0.89 


0.87 


5.8 



Table 1: Performance of various importance sampling densities in terms of N-PERP: Normalised 
perplexity; N-ESS: Normalised Effective Sample Size; a 2 {x\): Asymptotic variance of self-normalised 
IS estimator for the coordinate projection function h(x) = x\. Quantities marked with a dagger sign 
are straightforward to determine, all others have been obtained using IS with a sample size of one 
million. 

In addition to figures related to the initial IS density qo, Table Q] also reports performance obtained 
with the best fitting Gaussian IS density (with respect to the entropy criterion), which is straightfor- 
wardly obtained as the centred Gaussian density whose covariance matrix matches the one of n, that 
is, I p + s 2 u p Up. Of course the best possible performance achievable with a mixture of two Gaussian 
densities, always with the entropy criterion, is obtained when using tt as an IS density (second line 
of Tabled]). Finally both final lines of Table Q] report the best fit obtained with IS densities of the 
form 0.9 ^2^=1 a dM{p L di ^d) + O.lgo(') when, respectively, D = 1 and D = 2 (further comments on 
the use of these are given below). As a general comment on Table [H note that the variations of the 
perplexity of the IS weights, of the ESS and of the asymptotic variance of the IS estimate for the 
coordinate projection function are very correlated. This is a phenomenon that we have observed on 
many examples and which justifies our postulate that minimising the entropy criterion does provide 
very significant variance reductions for the IS estimate of "typical" functions of interest. 

In this example, one may categorise the possible outcomes of adaptive IS algorithms based on 
mixtures of Gaussian IS densities into mostly four situations: 

Disastrous (D.) After T iterations of the M-PMC scheme, qr a T gT\ is not a valid IS density (in 
the sense that the importance sampling unbiasedness property does not hold due to support 
restrictions) and may lead to inconsistent estimates. Typically, this may happen if q^ a T 8 T^ 
becomes much too peaky with light tails. As discussed above, it will also practically be the 
case if the algorithm only succeeds in fitting qr a T^T\ to one of both Gaussian modes of tt. An- 
other disastrous outcome is when the direct application of the adaptation rules described above 
leads to numerical problems, usually due to the poor conditioning of some of the covariance 
matrices S^. Rather than fixing these issues by ad-hoc solutions (eg. diagonal loading), which 
could nonetheless be useful in practical applications, we consider below more principled ways 
of making the algorithm more resistant to such failures. 
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Mediocre (M.) After adaptation, q/ a T gT\ is not significantly better than qo in terms of the perfor- 
mance criteria displayed in Table [1] and, in this case, the adaptation is useless. 

Good (G.) After T iterations, q^ a T^r^ selects the best fitting Gaussian approximation (second line 
of Table [TJ which already provides a very substantial improvement as it results in variance 
reductions by about four orders of magnitude for typical functions of interest. 

Excellent (E.) After T iterations, q( a T ,e T ) selects the best fitting mixture of two Gaussian densities, 
which in this somewhat artificial example corresponds to a perfect fit of ir. Note, however that, 
the actual gain over the previous outcome is rather moderate with a reduction of variance by a 
factor less than four. 

Of course, a very important parameter here is the IS sample size N: for a given initial IS density 
qo, if A is too small, any method based on IS is bound to fail, conversely when A" gets large all 
reasonable algorithms are expected to reach either the G. or E. result. Note that with local adaptive 
rules such as the ones proposed in this paper, it is not possible to guarantee that only the E. outcome 
will be achieved as the best fitting Gaussian IS density is indeed a stationary point (and in fact a 
local minimum) of the entropy criterion. So, depending on the initialisation, there always is a non 
zero probability that the algorithm converges to the G. situation only. 

To focus on situations where algorithmic robustness is an issue, we purposely chose to select a 
rather small IS sample size of N = 5, 000 points. As discussed above, direct IS estimates using qo as 
IS density would be mostly useless with such a modest sample size. We evaluated four algorithmic 
versions of the M-PMC algorithm. The first, Plain M-PMC, uses the parameter update formulas 
in (|14p and qo is only used as an initialisation value, which is common to all D components of the 
mixture (which also initially have equal weights). Only the means of the components are slightly 
perturbed to make it possible for the adaptation procedure to actually provide distinct mixture 
components. One drawback of the plain M-PMC approach is that we do not ensure during the course 
of the algorithm that the adapted mixture IS density remains appropriate for IS approximations, in 
particular that it provides reliable estimates of the parameter update formulas. To guarantee that 
the IS weights stay well behaved, we consider a version of the M-PMC algorithm in which the IS 
density is of the form 

D 

(1 - a ) ^ OLdN{iidi £ d ) + a q 
d=i 

with the difference that ceo is a fixed parameter which is not adapte d . The aim of this version, which 



we call Defensive M-PMC in reference to the work of iHesterbera (|1995l ). is to guarantee that the 
importance function remains bounded by Oq 1 7r(x)/qo(x), whatever happens during the adaptation, 
thus guaranteeing a finite variance. Since qo is a poor IS density, it is preferable to keep ao as low 
as possible and we used ao = 0.1 in all the following simulations. As detailed in both last lines of 
Table [TJ this modification will typically slightly limit the performances achievable by the adaptation 
procedure, although this drawback could probably be avoided by allowing for a decrease of ao during 
the iterations of the M-PMC. The parameter update formulas for this modified mixture model are 
very easily deduced from (|14p and are omitted here for the sake of conciseness. The third version we 
considered is termed Rao-Blackwellised M-PMC and consists in replacing the update equations (114p 
by their Rao-Blackwellised version (|15[) . Finally, we consider a fourth option in which both the 
defensive mixture density and the Rao-Blackwellised update formulas are used. 

All simulations were carried out using a sample size of Af = 5, 000, 20 iterations of the M-PMC 
algorithm and Gaussian mixtures with D = 3 components. Note that we purposely avoided to chose 
D = 2 to avoid the very artificial "perfect fit" phenomenon. This also means that for most runs of 
the algorithm, at least one component will disappear (by convergence of its weight to zero) or will be 
duplicated, with several components sharing very similar parameters. 
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Disastrous 


Mediocre 


Good 


Excellent 


Plain 


55 





33 


12 


Defensive 


13 


51 


30 


6 


R.-B. 


18 


1 


70 


11 


Defensive + R.-B. 


5 


11 


76 


8 



Table 2: Number of outcomes of each category for the four algorithmic versions, as recorded from 
100 independent runs. 



Table [2] display the performance of the four algorithms in repeated independent adaptation runs. 
The most significant observation about Table [2] is the large gap in robustness between the non Rao- 
Blackwellised versions of the algorithm, which returned disastrous or mediocre results in about 60% 
of the cases, a fraction that falls bellow 20% when the Rao-Blackwellised update formulas are used. 
Obviously the fact that the Rao-Blackwellised updates are based on all simulated values and not just 
on those actually simulated from a particular mixture component is a major source of robustness of 
the method when the sample size N is small, given the misfit of the initial IS density q^. The same 
remark also applies when the M-PMC algorithm is to be implemented with a large number D of 
components. The role of the defensive mixture component is more modest although it does improve 
the performance of both versions of the algorithm (non Rao-Blackwellised and Rao-Blackwellised 
altogether), at the price of a slight reduction of the frequency of the "Excellent" outcome. Also 
notice that the results obtained when the defensive mixture component is used are slightly beyond 
those of the unconstrained adaptation (see Table [1]). The frequency of the perfect or "Excellent" 
match is about 10% for all methods but this is a consequence of the local nature of the adaptation 
rule as well as of the choice of the initialisation of the algorithm. It should be stressed however 
that as we are not interested in modelling ir by a mixture but rather that we are seeking good IS 
densities, the solutions obtained in the G. or E. situations are only mildly different in this respect 
(see Tabled]). As a final comment, recall that the results presented above have been obtained with a 
fairly small sample size of iV = 5, 000. Increasing iV quickly reduces the failure rate of all algorithms: 
for iV = 20,000 for instance, the failure rate of the plain M-PMC algorithm drops to 7/100 while 
the Rao-Blackwellised versions achieve either the G. or E. result (and mostly the G. one, given the 
chosen initialisation) for all runs. 



4 Robust ificat ion via mixtures of multivariate i's 



We now consider the setting of a proposal composed of a mixture of p-dimensional t distributions, 

D 



(16) 



d=l 



We here follow the recommendations of West ( 1992 ) and Oh and Berger (jl993l ) who proposed using 
mixtures of t distributions in importance sampling. The t mixture is preferable to a normal mixture 
because of its heavier tails that can capture a wider range of non-Gaussian targets with a smaller 
number of components. This alternative setting is more challenging however and one must take 
advantage of the missing variable representation of the t distribution itself to achieve a closed-form 
updating of the parameters (fi d , T, d ) d approximating (jlip . since a true closed-form cannot be derived. 
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4.1 The latent-data framework 



Using the classical normal/chi-squared decomposition of the t distribution, a joint distribution asso- 
ciated with the t mixture proposal (|16p is 

S(x, y, z) oc a e \E z \-^ 2 exp {-(x - ^fX'^x - » z )y/2v z } y (»*+P)/*-i e -v/2 
oc a z ip(x; fj,g, u z T, z /y) u z /2, 1/2) , 

where, as above, x corresponds to the observable in (|16|) . z corresponds to the mixture indicator, and 
y corresponds to the xt completion. The normal density is denoted by 99 and the gamma density by 
c;. Both y and z correspond to latent variables in that the integral of the above in (y,z) returns (|16j) . 

In the associated M-PMC algorithm, we only update the expectations and the covariance struc- 
tures of the t distributions and not the number of degrees of freedom, given that there is no closed-form 
solution for the later. In that case, 6 a = (p-d, ^d) and, for each d = 1, . . . , D, the number of degrees 
of freedom v d is fixed. At iteration t, the integrated EM update of the parameter will involve the 
following "E" function 



x 



Q{(a t ,9 t ),(a,9)} = E* 



E 



Y,Z 
(a*,0*) 



{log(a z ) + logfapf; fiz, vz^ z /Y))\X} 



since the x 2 part does not involve the parameter 9 = (p, E). Given that 



Y, Z\X, 9 ~ S(y, z\x) oc a z ip{x; p z , v z Y, z /y) q(y; v z /2, 1/2) 



we have that 



and therefore 



Y\X,Z = d,9 ~Qa 



(u d + p)/2, - {1 + (X - VdfEjHX ~ Vd)/vd} 



x 



Q{(a t ,9 t ),(a,9)} = E* 



D 



d=l 

IE; 



J> d (X;a*,0*)log(a'J 

D r 

Pd(X; a\9 l )l log |S d | + (X - ^f^d'^X - p d ) 
d=i ^ 



Vd+P 



u d + (x-^^ d )-Hx-^ d ) 

where we have used both the notation, 

a d t(x;v d ,fM d ,?, d ) 



P d(X;a t ,9 t )=F a t t0t (Z = d\X) 



J2eLi a\t{x-V(i,p\,Y>\) 
with t(x; u, p, £) denoting the T(y, p, £) density, and the fact that 

lt\ twY rv/.. IV rr jl "d+P 



ld (X;9 t )=E r et {Y/u d \X,Z = d} 



^+(X- M *)T (E t)-l (X _ M t ) - 

Therefore, the "M" step of the integrated EM update is 

- [pd(X; a', #')] , 



K +1 



E* [pdjX^^^djX;^] 

E^[p d (X;a i ,^) 7d (X;^)] ' 
E* [p d (X; a', fl') 7 d(*; fl')(JT - p d +1 )(X - ^ +1 ) T ] 
E* [p d (X;a\9t)] 
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While the first update is the generic weight modificati on (1101). the latter formula e are (up to the 
integration with respect to X) essentially those found in lPee~nd McLachlar] «) for a mixture of 
t distributions. 



4.2 Parameter update 

As in Section ETTl the empirical update equations are obtained by using self-normalised IS with weights 
Qi t t given by © for both the numerator and the denominator of each of the above expressions. The 
Rao-Blackwellised approximation based on (|8|) yields 



N 

t+l,N - „ (v . „.t,N Q t,N\ 



i=l 



t+i,N = Eli *i,t Pd(X iit ; a^ N , e*> N ) ld (X iit ; 0^) X i<t 



E 



t+i,N _ Z^i=i ^i,tPd{X iit ,a ■ ,U< )j d {X ijt ;V > ) {X ijt - fj, d ){X i;t - fi d ) 



d 



Eli^,tPd(X ht ;a^,d^) 

while the standard update equations, based on (fT2j) . are obtained by replacing p d {Xi^a t,N ,9 t,N ) by 
t{Xi :t = d} in the above equations. 



4.3 Pima Indian example 

As a realistic if artificial illustration of the performances of the t mixture (]16j) . we study the posterior 
distribution of the parameters of a probit model. Th e corresponding dataset is borrowed from the 
MASS library of R (|B, Development Core Team! l200fih . It consists in the records of 532 Pima Indian 



women who were tested by the U.S. National Institute of Diabetes and Digestive and Kidney Diseases 
for diabetes. Four quantitative covariates were recorded, along with the presence or absence of 
diabetes. The corresponding probit model analyses the presence of diabetes, i.e. 

Fp(y = l|x) = 1 - F p (y = 0|x) = + x T (Pi, P2, 03, Pi)) 

with = (0q, . . . , 0±), x made of four covariates, the number of pregnancies, the plasma glucose 
concentration, the body mass index weight in kg/(height in m) 2 , and the age, and <E> corresponds 
to the cumulative distribution function of the standard normal. We use the flat prior distribution 
7r(/3|X) oc 1; in that case, the 5-dimensional target posterior distribution is such that 

532 

vr(/3|y,X) oc ft [HPo + (^) T (0u 02,0s, fa)}] W [l - HPo + (x*) T (/3i, 02, 0s, A)} 

i=l 

where x* is the value of the covariates for the i-th individuals and is the response of the i-th 
individuals. 

We first present some results for N = 10, 000 sample points and T = 500 iterations on Figures [1] — 
based on a mixture with 4 components and with the degrees of freedom chosen as v = (3, 6, 9, 18), 
respectively, when using the non Rao-Blackwellised version (|12|) . The unrealistic value of T is cho- 
sen purposely to illustrate the lack of stability of the update strategy when not using the Rao- 
Blackwellised version. Indeed, as can be seen from Figure [IJ which describes the evolution of the 
/id's, some components vary quite widely over iterations, but they also correspond to a rather stable 
overall estimate of 0, ^i,TP i,T , equal to (-5.54,0.051,0.019,0.055,0.022) over most iterations. 
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Figure 1: Pima Indians: Evolution of the components of the five fx^s over 500 iterations plotted 
by pairs: (clockwise from upper left side) (1,2), (3,4), (4,1) and (2,3). The colour code is blue for 
fix, yellow for /j, 2 , brown for ^3 and red for /x 4 . The additional dark path corresponds to the estimate 
of j3. All /x^'s were started in the vicinity of the MLE (3. 




Figure 2: Pima Indians: Evolution of the five S^'s over 500 iterations plotted by pairs for the 
diagonal elements: (clockwise from upper left side) (1,2), (3,4), (4,1) and (2,3). The colour code 
is blue for Si, yellow for £2, brown for S3 and red for £4. All S^'s were started at the covariance 
matrix of (3 produced by R glm() procedure. 




100 200 300 400 500 



Figure 3: Pima Indians: Evolution of the cumulated weights (top) and of the estimated entropy 
divergence K n [log(q aj g((3))] (bottom). 
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When looking at Figure O the quasi-constant entropy estimate after iteration 100 or so shows that, 
even in this situation, there is little need to perpetuate the iterations till the 500-th. 

Using a Rao-Blackwellised version of the updates shows a strong stabilisation for the updates 
of the parameters ad and (/x^, Ed), both in the number of iterations and in the range of the pa- 
rameters. The approximation to the Bayes estimate is obviously very close to the above estimation 
(—5.63,0.052,0.019,0.056,0.022). Figures H] and [5] show the immediate stabilisation provided by the 
Rao-Blackwellisation step. In this example, which is quite typical in this respect, we recommend to 
use less than T = 10 iterations in order to reserve most of the computational effort for increasing 
N, which is essential during the first adaptation steps (because the initial IS density is poor) and for 
the accuracy of the IS approximation in the final steps of the algorithm. Comparing the plain and 
Rao-Blackwellised update formulas, will really depend on how costly the parameter update is — and 
thus on the dimension of the model — compared to the other computational costs, and in particular 
the evaluation of the likelihood, which mostly depends on the number of observations. In the present 
case, the increase in run-time due to the use the Rao-Blackwellised formulas instead of the plain ones 
is negligible. 



5 Conclusions 

The M-PMC algorithm provides a flexible and robust framework for adapting general importance 
sampling densities represented as mixtures. The extension to mixtures of t distribution broadens the 
scope of the method b y allowing approxim ation of heavier tail targets. Moreover, we can extend here 
the remarks made in bouc et al.1 d2007al lbh. namely that the update mechanism provides an early 



stabilisation of the parameters of the mixture. It is therefore unnecessary to rely on a large value of 
T: with large enough sample sizes N at each iteration — especially on the initial iteration that requires 
many points to counter-weight a potentially poor initial proposal — , it is quite uncommon to fail to 
spot a stabilisation of both the estimates and of the entropy criterion within a few iterations. 

While this paper relies on the generic entropy criterion to update the mixture density, we want 
to stress that it is also possible to use a more focussed deviance criterion, namely the /i-entropy 

<£/i(tt, <?(«,<?)) = D(ir h \\q^ e) ) , 

with 

ir h (x) oc \h(x) - Tr(h)\ir(x) , 

that is tuned to the estimation of a particular function h, as it is well-known that the optimal choice 
of the importance density for the self- normalised importance sampling estimator is exactly ir^. Since 
the normalising constant in tt^ does not need to be known, one can derive an adaptive algorithm 
which resembles the method presented in this paper. It is expected that this modification will be 
helpful in reaching IS densities that provide a low approximation error for a specific function h, which 
is also a desirable feature of importance sampling in several applications. 
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Figure 4: Pima Indians: Evolution of the components of the five over 50 Rao-Blackwellised 
iterations plotted by pairs: (clockwise from upper left side) (1,2), (3,4), (4, 1) and (2,3). The colour 
code is blue for fix, yellow for fj,2, brown for 113 and red for ^4. The additional dark path corresponds 
to the estimate of (3. All ^'s were started in the vicinity of the MLE (3. 
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Figure 5: Pima Indians: Evolution of the cumulated weights (top) and of the estimated entropy 
divergence E 7r [log(g Qj g(/3))] (bottom) for the Rao-Blackwellised version. 



16 



Chen, R. and Liu, J. S. (1996). Predictive updating method and Bayesian classification. J. Royal 
Statist. Soc. Series B, 58(2):397-415. 

Douc, R., Guillin, A., Marin, J.-M., and Robert, C. (2007a). Convergence of adaptive mixtures of 
importance sampling schemes. Ann. Statist., 35(l):420-448. 

Douc, R., Guillin, A., Marin, J.-M., and Robert, C. (2007b). Minimum variance importance sampling 
via population Monte Carlo. ESAIM: Probability and Statistics, 11:427-447. 

Doucet, A., de Freitas, N., and Gordon, N. (2001). Sequential Monte Carlo Methods in Practice. 
Springer- Verlag, New York. 

Geweke, J. (1989). Bayesian inference in econometric models using Monte Carlo integration. Econo- 
metrica, 57:1317-1340. 

Hesterberg, T. (1995). Weighted average importance sampling and defensive mixture distributions. 
Technometrics, 37(2): 185-194. 

Oh, M. and Berger, J. (1993). Integration of multimodal functions by Monte Carlo importance 
sampling. J. American Statist. Assoc., 88:450-456. 

Peel, D. and McLachlan, G. (2000). Robust mixture modelling using the t distribution. Statistics 
and Computing, 10:339-348. 

R Development Core Team (2006). R: A Language and Environment for Statistical Computing. R 
Foundation for Statistical Computing, Vienna, Austria. 

Robert, C. and Casella, G. (2004). Monte Carlo Statistical Methods. Springer- Verlag, New York, 
second edition. 

Rubinstein, R. Y. and Kroese, D. P. (2004). The Cross-Entropy Method. Springer- Verlag, New York. 

West, M. (1992). Modelling with mixtures. In Berger, J., Bernardo, J., Dawid, A., and Smith, A., 
editors, Bayesian Statistics 4, pages 503-525. Oxford University Press, Oxford. 



17 



